conjugate gradient
Regularized Nonlinear Acceleration
Damien Scieur, Alexandre d'Aspremont, Francis Bach
We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple and small linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.
Scalable Principal-Agent Contract Design via Gradient-Based Optimization
Galanti, Tomer, Bookseller, Aarya, Ray, Korok
We study a bilevel \emph{max-max} optimization framework for principal-agent contract design, in which a principal chooses incentives to maximize utility while anticipating the agent's best response. This problem, central to moral hazard and contract theory, underlies applications ranging from market design to delegated portfolio management, hedge fund fee structures, and executive compensation. While linear-quadratic models such as Holmstr"om-Milgrom admit closed-form solutions, realistic environments with nonlinear utilities, stochastic dynamics, or high-dimensional actions generally do not. We introduce a generic algorithmic framework that removes this reliance on closed forms. Our method adapts modern machine learning techniques for bilevel optimization -- using implicit differentiation with conjugate gradients (CG) -- to compute hypergradients efficiently through Hessian-vector products, without ever forming or inverting Hessians. In benchmark CARA-Normal (Constant Absolute Risk Aversion with Gaussian distribution of uncertainty) environments, the approach recovers known analytical optima and converges reliably from random initialization. More broadly, because it is matrix-free, variance-reduced, and problem-agnostic, the framework extends naturally to complex nonlinear contracts where closed-form solutions are unavailable, such as sigmoidal wage schedules (logistic pay), relative-performance/tournament compensation with common shocks, multi-task contracts with vector actions and heterogeneous noise, and CARA-Poisson count models with $\mathbb{E}[X\mid a]=e^{a}$. This provides a new computational tool for contract design, enabling systematic study of models that have remained analytically intractable.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
Submitted by Assigned_Reviewer_1 Q1 The authors propose a flexible and interpretable kernel (the CSM kernel), building on spectral mixture kernels, for learning relationships between multiple tasks. The starting point is to use Gaussian processes with 1 component spectral mixture kernels as the basis functions in a linear model of coregionalisation (SM-LMC). However, SM-LMC does not contain information about the phases between channels. Thus the authors propose the cross spectral mixture kernel, which mixes phase shifted versions of spectral mixture kernels across channels. The resulting kernel is interpretable and flexible.
Fast Gaussian process inference by exact Matรฉrn kernel decomposition
Langrenรฉ, Nicolas, Warin, Xavier, Gruet, Pierre
To speed up Gaussian process inference, a number of fast kernel matrix-vector multiplication (MVM) approximation algorithms have been proposed over the years. In this paper, we establish an exact fast kernel MVM algorithm based on exact kernel decomposition into weighted empirical cumulative distribution functions, compatible with a class of kernels which includes multivariate Matรฉrn kernels with half-integer smoothness parameter. This algorithm uses a divide-and-conquer approach, during which sorting outputs are stored in a data structure. We also propose a new algorithm to take into account some linear fixed effects predictor function. Our numerical experiments confirm that our algorithm is very effective for low-dimensional Gaussian process inference problems with hundreds of thousands of data points. An implementation of our algorithm is available at https://gitlab.com/warin/fastgaussiankernelregression.git.
Comparing regularisation paths of (conjugate) gradient estimators in ridge regression
Hucker, Laura, Reiร, Markus, Stark, Thomas
We consider standard gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimizing a penalized ridge criterion in linear regression. While it is well known that conjugate gradients exhibit fast numerical convergence, the statistical properties of their iterates are more difficult to assess due to inherent nonlinearities and dependencies. On the other hand, standard gradient flow is a linear method with well known regularizing properties when stopped early. By an explicit non-standard error decomposition we are able to bound the prediction error for conjugate gradient iterates by a corresponding prediction error of gradient flow at transformed iteration indices. This way, the risk along the entire regularisation path of conjugate gradient iterations can be compared to that for regularisation paths of standard linear methods like gradient flow and ridge regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor. Numerical examples show the similarity of the regularisation paths in practice.
Review for NeurIPS paper: Faster Randomized Infeasible Interior Point Methods for Tall/Wide Linear Programs
Weaknesses: This paper provides a nice advance in the theory of infeasible-start long-step IPMs, however the novelty of the approach taken and the relation of the work in the paper to prior work could use further clarity. First, solving regression problems in an A in nearly linear time, when A has many more rows than columns has been the subject of a line of research, e.g. These results, including ones based on the subspace embedding result used in this paper, readily extend to solving linear systems in A T A and this has been used by the Theoretical Computer Science papers mentioned for implementing short step IPMs. Consequently, I think it would have been beneficial to state earlier that the paper is using the known linear system solving machinery of subspace embeddings to build preconditioners (rather than just saying that "Randomized Linear Algebra" is used) and put this in the context of prior work. There may be novelty in the particular way in which the paper is using conjugate gradient and subspace embeddings, however the paper would be strengthened if it articulated how this is different than this previous literature; as the appendix points out, conjugate gradient can be replaced with other iterative methods which possibly puts the approach considered closer to the ones from the literature. In light of the previous paragraph, I think more of the novelty in the paper may lie in exactly how they handle the error from approximate linear system solves in a way sensitive to the design of the preconditioner.
Review for NeurIPS paper: Probabilistic Linear Solvers for Machine Learning
Strengths: EDIT after rebuttal: Thank you authors for clarifying the following: - GP regression on log(Rayleigh_i): satisfactory reply, this algorithm takes into account uncertainty about eigenvalues beyond t 1. - Transfer learning: reusing the posterior covariance as a prior makes the method converge faster than if just the mean is reused. I'm still confused about this, but a little bit less: - Empirical Bayes: is indeed common, and in many applications the prior is updated as more data comes in. For example, in Bayesian optimization, after acquiring an extra point the GP hyperparameters are re-optimized. However, the weird thing here, which the authors clarified in the rebuttal, is that the prior used at each time step *contains future observations in it*. Does this imply that the posterior covariance is impossible to calculate in the middle of the algorithm, before it is terminated and thus we have the full S matrix?
Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning
Tian, Jiawu, Xu, Liwei, Zhang, Xiaowei, Li, Yongqi
These authors contributed equally to this work. Abstract Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset. Introduction Deep learning has been used in many aspects, such as recommendation systems [1], natural language processing [2], image recognition [3], reinforcement learning [4], etc. Neural network model is the main research object of deep learning, which includes input layer, hidden layer and output layer. Each layer includes a certain number of neurons, and each neuron is connected with each other in a certain way. The parameters and connection parameters of each neuron determine the performance of the deep learning model.